Load Packages

packages <- c("tidyverse","janitor") 
sapply(packages, library, character.only = T) 
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.2     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.4.2     ✔ tibble    3.2.1
## ✔ lubridate 1.9.2     ✔ tidyr     1.3.0
## ✔ purrr     1.0.1     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
## 
## Attaching package: 'janitor'
## 
## 
## The following objects are masked from 'package:stats':
## 
##     chisq.test, fisher.test
## $tidyverse
##  [1] "lubridate" "forcats"   "stringr"   "dplyr"     "purrr"     "readr"    
##  [7] "tidyr"     "tibble"    "ggplot2"   "tidyverse" "stats"     "graphics" 
## [13] "grDevices" "utils"     "datasets"  "methods"   "base"     
## 
## $janitor
##  [1] "janitor"   "lubridate" "forcats"   "stringr"   "dplyr"     "purrr"    
##  [7] "readr"     "tidyr"     "tibble"    "ggplot2"   "tidyverse" "stats"    
## [13] "graphics"  "grDevices" "utils"     "datasets"  "methods"   "base"
setwd("~/Documents/han-lab/")

1. Data Source

Load Data

ZooScore Dataset

ZooScore dataset compiles ZooScores determined for a variety of pathogens and parasites collected from the Global Mammal Parasite Database (GMPD). The image below shows the decision tree that a ZooScore is calculated with, ranging from a score of -1 representing a pathogen not found in humans to a score of 3 representing a pathogen capable of human to human transmission (e.g., SARS-CoV-2).

Question 1.

How does the difference between condition 1 (which can be acquired through a vertebrate reservoir) compare to condition 2 (which is not transmitted to other humans) and condition 3 (which is transmissible to other humans)? Does condition 1 imply that its transmissibility to other humans has not been discovered?

Answer: (1) Not possible at the moment.MERS-Cov, Hendra Virus.

2. EDA

2-1. Basic Information

The glimpse() function below tells the number of rows and columns, names of the variables, what are the data types of the variables in the data.

ZooScore Data

df_zs%>% glimpse()
## Rows: 2,008
## Columns: 28
## $ parasite_corrected_name   <chr> "Acanthocephalus anguillae", "Acanthocephalu…
## $ insect                    <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ genus_only                <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ commensal                 <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ zoo_score                 <chr> "-1", "-1", "-1", "-1", "0", "-1", "2", "-1"…
## $ confidence_score          <dbl> 3, 3, 1, 2, 1, 1, 1, 1, NA, 1, 1, 2, 2, 1, 2…
## $ xc_zoo_score              <dbl> -1, -1, -1, -1, 0, -1, -1, -1, -1, 3, 3, 0, …
## $ xc_c_score                <dbl> 1, 2, 2, 2, 2, 3, 2, 3, 3, 1, 1, 1, 1, 1, 1,…
## $ xc_notes                  <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ xc_who_by                 <chr> "VR", "VR", "VR", "VR", "VR", "VR", "VR", "V…
## $ xc_date                   <dbl> 42765, 42765, 42765, 42765, 42765, 42765, 42…
## $ pgf_zoo_score             <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ pgf_c_score               <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ pgf_notes                 <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ non_gmpd                  <chr> "0", "0", NA, NA, NA, NA, NA, NA, NA, NA, NA…
## $ search_string_goog        <chr> "Acanthocephalus anguillae", "Acanthocephalu…
## $ googlehits_as_of_2_8_2017 <dbl> 1410, 431, 217, 65, 713, 52, 404, 82, 64, 65…
## $ search_string_wos         <chr> "Acanthocephalus anguillae", "Acanthocephalu…
## $ wo_shits_as_of_2_6_2017   <dbl> 57, 23, 13, 0, 4, 2, 39, 9, 1, 7371, 2464, 2…
## $ notes                     <chr> NA, NA, "H: dog", "H: primate", NA, "H: Racc…
## $ citation                  <chr> "Kennedy and Moriarty 1987", "Heckmann et al…
## $ print_ref                 <chr> NA, NA, NA, NA, NA, NA, "NEED", NA, NA, NA, …
## $ who_by                    <chr> "VR", "VR", "VR", "VR", "VR", "VR", "VR", "V…
## $ date_entry                <dbl> 42765, 42765, 42765, 42765, 42765, 42765, 42…
## $ xc_citation               <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ pgf_citation              <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ pgf_more_citations        <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ nematode                  <dbl> 0, 0, 0, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1,…

There are 28 columns and 2008 rows. Each column represents a variable related to the parasite and its zooscore calculated by investigators. Each row represents each parasite. Since the variable parasite_corrected_name plays a role of index, the total number of rows and the unique number of parasite_corrected_name should be matched. To verify this, I displayed how many distinct values of parasite_corrected_name exist.

df_zs %>%
  distinct(parasite_corrected_name) %>% count()
## # A tibble: 1 × 1
##       n
##   <int>
## 1  2007

Since there is a discrepancy between the number of parasite_corrected_name and the total number of rows, we should look at whether or not there were any data entry issues.

df_zs[df_zs$parasite_corrected_name %in% names(which(table(df_zs$parasite_corrected_name) > 1)), ]
## # A tibble: 2 × 28
##   parasite_corrected_name insect genus_only commensal zoo_score confidence_score
##   <chr>                    <dbl>      <dbl> <lgl>     <chr>                <dbl>
## 1 Ascaris suum                NA          0 NA        2                        2
## 2 Ascaris suum                NA          0 NA        1                        1
## # ℹ 22 more variables: xc_zoo_score <dbl>, xc_c_score <dbl>, xc_notes <lgl>,
## #   xc_who_by <chr>, xc_date <dbl>, pgf_zoo_score <dbl>, pgf_c_score <dbl>,
## #   pgf_notes <chr>, non_gmpd <chr>, search_string_goog <chr>,
## #   googlehits_as_of_2_8_2017 <dbl>, search_string_wos <chr>,
## #   wo_shits_as_of_2_6_2017 <dbl>, notes <chr>, citation <chr>,
## #   print_ref <chr>, who_by <chr>, date_entry <dbl>, xc_citation <lgl>,
## #   pgf_citation <chr>, pgf_more_citations <chr>, nematode <dbl>

One parasite has been identified as duplicates, but the variables associated with each entry are different. This may require further investigation.

[is.na()] Let’s look at if there are missing values.

purrr::map_dbl(df_zs, ~sum(is.na(.)))
##   parasite_corrected_name                    insect                genus_only 
##                         0                      2004                         1 
##                 commensal                 zoo_score          confidence_score 
##                      2008                       837                       877 
##              xc_zoo_score                xc_c_score                  xc_notes 
##                         0                         1                      2008 
##                 xc_who_by                   xc_date             pgf_zoo_score 
##                         0                         0                      1888 
##               pgf_c_score                 pgf_notes                  non_gmpd 
##                      1887                      1874                       930 
##        search_string_goog googlehits_as_of_2_8_2017         search_string_wos 
##                         0                         0                         0 
##   wo_shits_as_of_2_6_2017                     notes                  citation 
##                         0                      1059                         2 
##                 print_ref                    who_by                date_entry 
##                      1760                         1                         0 
##               xc_citation              pgf_citation        pgf_more_citations 
##                      2008                      1895                      1931 
##                  nematode 
##                      1767
df_zs %>%
  summarise(across(everything(), ~sum(is.na(.)))) %>%
  pivot_longer(everything(), names_to = "column", values_to = "count") %>%
  ggplot(aes(x = column, y = count)) +
  geom_bar(stat = "identity", fill = "#D31245", width = 0.5) +
  geom_text(aes(label = count), vjust = -0.5, color = "black", size = 2.5) +
  xlab("Column") +
  ylab("Missing Value Count") +
  theme_bw() +
  theme(axis.text.x = element_text(angle = 90, hjust = 1))

Some variables have too many missing values. In particular, insect, commensal, xc_notes, pgf_zoo_score, pgf_c_score, pgf_notes, notes, print_ref, xc_citation, pgf_citation, pgf_more_citations, nematode.

Question 2.

Is there any other dataset that categorizes parasites by their family or genetic tree species? If so, it would help in filling in missing information related to the parasite’s features, such as insect, commensal, and nematode.

[n_distinct] Let’s look at how many unique values are there per variable.

df_zs %>%
  summarise(across(everything(), n_distinct)) %>%
  pivot_longer(everything(), names_to = "column", values_to = "count") %>%
  ggplot(aes(x = column, y = count)) +
  geom_bar(stat = "identity", fill = "#D31245", width = 0.5) +
  geom_text(aes(label = count), vjust = -0.5, color = "black", size = 2.5) +
  xlab("Column") +
  ylab("Distinct Count") +
  theme_bw() +
  theme(axis.text.x = element_text(angle = 90, hjust = 1))

purrr::map(df_zs, n_distinct)
## $parasite_corrected_name
## [1] 2007
## 
## $insect
## [1] 2
## 
## $genus_only
## [1] 4
## 
## $commensal
## [1] 1
## 
## $zoo_score
## [1] 17
## 
## $confidence_score
## [1] 5
## 
## $xc_zoo_score
## [1] 6
## 
## $xc_c_score
## [1] 4
## 
## $xc_notes
## [1] 1
## 
## $xc_who_by
## [1] 1
## 
## $xc_date
## [1] 94
## 
## $pgf_zoo_score
## [1] 5
## 
## $pgf_c_score
## [1] 4
## 
## $pgf_notes
## [1] 132
## 
## $non_gmpd
## [1] 4
## 
## $search_string_goog
## [1] 2007
## 
## $googlehits_as_of_2_8_2017
## [1] 939
## 
## $search_string_wos
## [1] 2007
## 
## $wo_shits_as_of_2_6_2017
## [1] 442
## 
## $notes
## [1] 771
## 
## $citation
## [1] 1517
## 
## $print_ref
## [1] 9
## 
## $who_by
## [1] 3
## 
## $date_entry
## [1] 93
## 
## $xc_citation
## [1] 1
## 
## $pgf_citation
## [1] 109
## 
## $pgf_more_citations
## [1] 74
## 
## $nematode
## [1] 3

[1. parasite_corrected_name]

There are 2008 data points in parasite_corrected_name. As mentioned earlier, each unique value represents a row in this data. (One duplicate)

df_zs%>%
  select(parasite_corrected_name)%>%
  mutate(parasite_corrected_name = as.factor(parasite_corrected_name))%>%
  summary()
##                      parasite_corrected_name
##  Ascaris suum                    :   2      
##  Acanthocephalus anguillae       :   1      
##  Acanthocephalus ranae           :   1      
##  Acanthocheilonema dracunculoides:   1      
##  Acanthocheilonema gracile       :   1      
##  Acanthocheilonema perstans      :   1      
##  (Other)                         :2001

[2. insect]

There are only four entries, and all of them have the value of insect (1)

df_zs%>%
  select(insect)%>%
   mutate(insect = as.factor(insect))%>%
  summary()
##   insect    
##  1   :   4  
##  NA's:2004
df_zs%>%
  mutate(insect= as.factor(insect))%>%
  ggplot(mapping=aes(x=insect))+
  geom_bar()+
  geom_label(stat='count',
    mapping =aes(label = stat(count)),
            color = '#D31245',size = 4, vjust= 0.3 )+
  theme_bw()
## Warning: `stat(count)` was deprecated in ggplot2 3.4.0.
## ℹ Please use `after_stat(count)` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

[3. genus_only]

This variable represents the question, Is the pathogen/parasite representing the entire genus?

df_zs%>%
  select(genus_only)%>%
   mutate(genus_only = as.factor(genus_only))%>%
  summary()
##  genus_only 
##  0   :1853  
##  1   :  21  
##  3   : 133  
##  NA's:   1
df_zs%>%
  mutate(genus_only= as.factor(genus_only))%>%
  ggplot(mapping=aes(x=genus_only))+
  geom_bar()+
  geom_label(stat='count',
    mapping =aes(label = stat(count)),
            color = '#D31245',size = 4, vjust= 0.3 )+
  theme_bw()

Question 3.

What does each values mean? - 0: Is the pathogen/parasite representing the entire genus? - 1: Is the pathogen/parasite representing the entire genus? - 3: Is the pathogen/parasite representing the entire genus?

[4. commensal]

This variable represents the question, is the pathogen beneficial without harming its host? However, there are no data points.

df_zs%>%
  select(commensal)%>%
   mutate(commensal = as.factor(commensal))%>%
  summary()
##  commensal  
##  NA's:2008

[5. zoo_score]

The score assigned to the pathogen should range from -1 to 3, as stated in the documentation. However, there are 60 values that fall outside this range, in addition to 837 missing (NA) values

df_zs%>%
  select(zoo_score)%>%
   mutate(zoo_score = as.factor(zoo_score))%>%
  summary()
##    zoo_score  
##  -1     :523  
##  0      :316  
##  1      :165  
##  2      : 79  
##  3      : 28  
##  (Other): 60  
##  NA's   :837
df_zs%>%
  mutate(zoo_score= as.factor(zoo_score))%>%
  mutate(zoo_score = fct_lump(zoo_score, n = 5, other_level = "(other)")) %>%
  ggplot(mapping=aes(x=zoo_score))+
  geom_bar()+
  geom_label(stat='count',
    mapping =aes(label = stat(count)),
            color = '#D31245',size = 4, vjust= 0.3 )+
  theme_bw()

[6. confidence_score]

The score represents the confidence level in the ZooScore, with 1 indicating high confidence and 3 indicating low/no confidence. There are 877 NA values, which should be investigated to determine the underlying reasons. Additionally, there is one data point that exceeds the range, marked as 33, which is likely a typo and should be 3.

df_zs%>%
  select(confidence_score)%>%
   mutate(confidence_score = as.factor(confidence_score))%>%
  summary()
##  confidence_score
##  1   :367        
##  2   :399        
##  3   :364        
##  33  :  1        
##  NA's:877
df_zs%>%
  mutate(confidence_score= as.factor(confidence_score))%>%
    ggplot(mapping=aes(x=confidence_score))+
  geom_bar()+
  geom_label(stat='count',
    mapping =aes(label = stat(count)),
            color = '#D31245',size = 4, vjust= 0.3 )+
  theme_bw()

[7. xc_zoo_score]

The xc_zoo_score represents the cross-checked ZooScore after review by multiple individuals. It has the same range as the regular ZooScore. The values in xc_zoo_score appear to be more complete compared to zoo_score, as there are no missing (NA) values. There is one data point (-2) that exceeds the expected range.

df_zs%>%
  select(xc_zoo_score)%>%
   mutate(xc_zoo_score = as.factor(xc_zoo_score))%>%
  summary()
##  xc_zoo_score
##  -2:   1     
##  -1:1402     
##  0 : 335     
##  1 : 136     
##  2 :  70     
##  3 :  64
df_zs%>%
  mutate(xc_zoo_score= as.factor(xc_zoo_score))%>%
  ggplot(mapping=aes(x=xc_zoo_score))+
  geom_bar()+
  geom_label(stat='count',
    mapping =aes(label = stat(count)),
            color = '#D31245',size = 4, vjust= 0.3 )+
  theme_bw()

[8. xc_c_score]

The xc_c_score represents the cross-checked confidence score after review by multiple individuals. It has the same range as the regular confidence score. The values in xc_c_score appear to be more complete compared to confidence_score, as there are less missing (NA) values. All data points are within the expected range.

Question 4.

Are xc_zoo_score and xc_c_score considered as the final ZooScore and Confidence score after cross-checking by multiple individuals?”

df_zs%>%
  select(xc_c_score)%>%
   mutate(xc_c_score = as.factor(xc_c_score))%>%
  summary()
##  xc_c_score
##  1   :693  
##  2   :433  
##  3   :881  
##  NA's:  1
df_zs%>%
  mutate(xc_c_score= as.factor(xc_c_score))%>%
  ggplot(mapping=aes(x=xc_c_score))+
  geom_bar()+
  geom_label(stat='count',
    mapping =aes(label = stat(count)),
            color = '#D31245',size = 4, vjust= 0.3 )+
  theme_bw()

[9. xc_notes]

There were no xc_notes entries.

df_zs%>%
  select(xc_notes)%>%
   mutate(xc_note = as.factor(xc_notes))%>%
  summary()
##  xc_notes       xc_note    
##  Mode:logical   NA's:2008  
##  NA's:2008

[10. xc_who_by]

The xc_who_by variable represents the investigator who performed the cross-checking. There was only one unique value, which is ‘VR’.

df_zs%>%
  select(xc_who_by)%>%
   mutate(xc_who_by = as.factor(xc_who_by))%>%
  summary()
##  xc_who_by
##  VR:2008

[11. xc_date]

The xc_date variable represents the date when the cross-checking was performed. The original date format was not meaningful and needed to be converted appropriately using the format “%Y-%m-%d”. The cross-checking process started on 2016-07-11 and completed with its last entry on 2017-01-30.

df_zs$xc_date <- as.Date(df_zs$xc_date, origin = "1899-12-30")
df_zs%>%
  select(xc_date)%>%
  summary()
##     xc_date          
##  Min.   :2016-07-11  
##  1st Qu.:2016-10-25  
##  Median :2016-11-17  
##  Mean   :2016-11-15  
##  3rd Qu.:2016-12-15  
##  Max.   :2017-01-30
df_zs %>%
  mutate(the_year = lubridate::year(xc_date),
         the_month = lubridate::month(xc_date)) %>%
  ggplot(mapping = aes(x = the_year, y = 1)) +
  geom_jitter(width = 0.4, height = 0.07, alpha = 0.2, color = '#D31245') +
  theme_bw() +
  theme(panel.grid.minor.x = element_blank(),
        axis.text.y = element_blank(),
        axis.ticks.y = element_blank())

[12.pgf-variables ]

These variables (pgf_zoo_score, pgf_c_score, pgf_notes, pgf_citation, pgf_more_citations) are provided by Pasha, a former lab member. However, it is unclear how meaningful they are in the context of the analysis. Further investigation is needed to determine their significance.

df_zs%>%
  select(pgf_zoo_score)%>%
   mutate(pgf_zoo_score = as.factor(pgf_zoo_score))%>%
  summary()
##  pgf_zoo_score
##  -1  :   9    
##  0   :  41    
##  2   :  56    
##  3   :  14    
##  NA's:1888
df_zs%>%
  select(pgf_c_score)%>%
   mutate(pgf_c_score = as.factor(pgf_c_score))%>%
  summary()
##  pgf_c_score
##  1   :  41  
##  2   :  41  
##  3   :  39  
##  NA's:1887
df_zs%>%
  select(pgf_citation)%>%
   mutate(pgf_citation = as.factor(pgf_citation))%>%
  summary()
##                      pgf_citation 
##  Acha                      :   5  
##  Falkinham III 1996        :   2  
##  Acha et al. vol II pg. 229:   1  
##  Acha Vol I                :   1  
##  Albert & Stevens 2010     :   1  
##  (Other)                   : 103  
##  NA's                      :1895
df_zs%>%
  select(pgf_more_citations)%>%
   mutate(pgf_more_citations = as.factor(pgf_more_citations))%>%
  summary()
##                          pgf_more_citations
##  Abrahamian and Goldstein 2011    :   3    
##  Acha                             :   3    
##  Acha vol 1, pg 199               :   1    
##  Acha vol 3, pg. 64 & Coatney 1971:   1    
##  Acha, vol 3, pg 63+ & Baird 2009 :   1    
##  (Other)                          :  68    
##  NA's                             :1931

[13. non_gmpd]

The non_gmpd variable indicates whether a pathogen is not present in GMPD (Global Microbial Pathogen Database). In the dataset, there are 30 data points that are categorized as not sourced from GMPD. Additionally, there are 930 missing values.

df_zs%>%
  select(non_gmpd)%>%
   mutate(non_gmpd = as.factor(non_gmpd))%>%
  summary()
##                               non_gmpd   
##  0                                :1047  
##  1                                :  30  
##  Meningonema peruzzii transmission:   1  
##  NA's                             : 930
df_zs%>%
  mutate(non_gmpd= as.factor(non_gmpd))%>%
  ggplot(mapping=aes(x=non_gmpd))+
  geom_bar()+
  geom_label(stat='count',
    mapping =aes(label = stat(count)),
            color = '#D31245',size = 4, vjust= 0.3 )+
  theme_bw()

[14. search_string_goog]

The search_string_goog variable represents the exact search string used for Google Scholar search. The results appear to be identical to the values in the parasite_corrected_name variable.

df_zs%>%
  select(search_string_goog)%>%
   mutate(search_string_goog = as.factor(search_string_goog))%>%
  summary()
##                         search_string_goog
##  Ascaris suum                    :   2    
##  Acanthocephalus anguillae       :   1    
##  Acanthocephalus ranae           :   1    
##  Acanthocheilonema dracunculoides:   1    
##  Acanthocheilonema gracile       :   1    
##  Acanthocheilonema perstans      :   1    
##  (Other)                         :2001

[15. googlehits]

The googlehits_as_of_2_8_2017 represents how many hits were found in Google Scholar. It shows a right-skew in the distribution of googlehits_as_of_2_8_2017. The presence of a large maximum value (2650000.0) seems significantly influence the mean and skew the distribution.

df_zs%>%
  select(googlehits_as_of_2_8_2017)%>%
    summary()
##  googlehits_as_of_2_8_2017
##  Min.   :      0.0        
##  1st Qu.:     35.0        
##  Median :    214.5        
##  Mean   :  12542.1        
##  3rd Qu.:   2122.5        
##  Max.   :2650000.0
df_zs %>% 
  ggplot(mapping = aes(x = googlehits_as_of_2_8_2017)) +
  geom_rug(size = 1) +
  stat_ecdf(size = 1.2) +
  theme_bw()
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

[16. search_string_wos]

The search_string_wos variable represents the exact search string used for Web of Science search. The results appear to be identical to the values in the parasite_corrected_name variable.

df_zs%>%
  select(search_string_wos)%>%
   mutate(search_string_wos = as.factor(search_string_wos))%>%
  summary()
##                         search_string_wos
##  Ascaris suum                    :   2   
##  Acanthocephalus anguillae       :   1   
##  Acanthocephalus ranae           :   1   
##  Acanthocheilonema dracunculoides:   1   
##  Acanthocheilonema gracile       :   1   
##  Acanthocheilonema perstans      :   1   
##  (Other)                         :2001

[17. wo_shits]

The `wo_shits_as_of_2_6_2017`` represents how many hits were found in Web of Science. It shows a similar distribution as google hits’.

df_zs%>%
  select(wo_shits_as_of_2_6_2017)%>%
    summary()
##  wo_shits_as_of_2_6_2017
##  Min.   :     0.0       
##  1st Qu.:     1.0       
##  Median :     5.0       
##  Mean   :   708.3       
##  3rd Qu.:    68.0       
##  Max.   :391855.0
df_zs %>% 
  ggplot(mapping = aes(x = wo_shits_as_of_2_6_2017)) +
  geom_rug(size = 1) +
  stat_ecdf(size = 1.2) +
  theme_bw()

[18. notes]

This includes any notes for the record.

df_zs%>%
  select(notes)%>%
   mutate(notes = as.factor(notes))%>%
  summary()
##          notes     
##  equid      :  19  
##  Nematoda   :  13  
##  Nematode   :  12  
##  Bacterium  :  10  
##  tick vector:  10  
##  (Other)    : 885  
##  NA's       :1059

[19. citation]

This variable represents the citations used to support the ZooScore. The folder where these citations are stored can be shared. It would also be interesting to explore if there are any investigators who are specifically associated with certain pathogens or parasites based on the citations.

df_zs%>%
  select(citation)%>%
   mutate(citation = as.factor(citation))%>%
  summary()
##                      citation   
##  Gideon                  :  87  
##  NEED                    :  51  
##  Stuart et al., 1998     :  18  
##  Irwin and Raharison 2009:  13  
##  Scialdo-Krecek 1983     :  12  
##  (Other)                 :1825  
##  NA's                    :   2

[21. who_by]

This variable represents the investigator who assigned the data points. All data points seem primarily assigned to one investigator, VR.

df_zs%>%
  select(who_by)%>%
   mutate(who_by = as.factor(who_by))%>%
  summary()
##   who_by    
##  Vr  :   1  
##  VR  :2006  
##  NA's:   1

[22. date_entry]

The date_entry variable represents the date when the zooscore was assigned. The original date format was not meaningful and needed to be converted appropriately using the format “%Y-%m-%d”. The date-entry started on 2016-01-8 and the last entry is 2017-01-30.

df_zs$date_entry <- as.Date(df_zs$date_entry, origin = "1899-12-30")
df_zs%>%
  select(date_entry)%>%
  summary()
##    date_entry        
##  Min.   :2016-01-08  
##  1st Qu.:2016-10-25  
##  Median :2016-11-17  
##  Mean   :2016-11-14  
##  3rd Qu.:2016-12-15  
##  Max.   :2017-01-30
df_zs %>%
  mutate(the_year = lubridate::year(date_entry),
         the_month = lubridate::month(date_entry)) %>%
  ggplot(mapping = aes(x = the_year, y = 1)) +
  geom_jitter(width = 0.4, height = 0.07, alpha = 0.2, color = '#D31245') +
  theme_bw() +
  theme(panel.grid.minor.x = element_blank(),
        axis.text.y = element_blank(),
        axis.ticks.y = element_blank())

[23. xc_citation]

No data points available.

df_zs%>%
  select(xc_citation)%>%
   mutate(xc_citation = as.factor(xc_citation))%>%
  summary()
##  xc_citation
##  NA's:2008

[24. nematode]

This variable indicates whether the pathogen/parasite is a nematode (worm).

df_zs%>%
  select(nematode)%>%
   mutate(nematode = as.factor(nematode))%>%
  summary()
##  nematode   
##  0   : 151  
##  1   :  90  
##  NA's:1767
df_zs%>%
  mutate(nematode= as.factor(nematode))%>%
  ggplot(mapping=aes(x=nematode))+
  geom_bar()+
  geom_label(stat='count',
    mapping =aes(label = stat(count)),
            color = '#D31245',size = 4, vjust= 0.3 )+
  theme_bw()+
  theme(axis.text.x = element_text(angle = 60, hjust = 1))

# Save df_zs to a CSV file
write.csv(df_zs, "df_zs.csv", row.names = FALSE)

2-2. Focuse on xc_zoo_score.

1.xc_zoo_score and genus

The meaning of the values in the genus_only variable is currently unknown. Values such as 0, 1, 2, and 3 do not have a defined interpretation at this point. However, based on observations, it appears that when the genus_only value is either 0 or 3, there are indications of zoonotic features.

df_zs %>% 
  ggplot(mapping = aes(x =xc_zoo_score))+
  geom_bar(mapping = aes(fill = as.factor(xc_zoo_score)))+
  facet_wrap(~genus_only)+
   ggthemes::scale_fill_colorblind('xc_zoo_score')+
  theme_bw()

2.xc_zoo_score and nematode

The high number of missing values limits the interpretability and significance of any zooscore associated with the nematode variable.

df_zs %>% 
  ggplot(mapping = aes(x =xc_zoo_score))+
  geom_bar(mapping = aes(fill = as.factor(xc_zoo_score)))+
  facet_wrap(~nematode)+
   ggthemes::scale_fill_colorblind('xc_zoo_score')+
  theme_bw()

3. xc_zoo_score and xc_c_score

df_zs %>%
  filter(!is.na(xc_zoo_score) & !is.na(xc_c_score))%>%
  count(xc_zoo_score, xc_c_score) %>%
  ggplot(mapping = aes(x=xc_zoo_score, y=xc_c_score))+
  geom_tile(mapping = aes(fill = n),
    color = 'black')+
  geom_label(mapping = aes(label = n,
                          color = n > median(n)),
            size = 2.5)+
  scale_color_manual(guide = 'none', values = c('TRUE' = '#D31245',
                                                "FALSE" = '#091F40'))+
  scale_fill_continuous()+
  theme_bw()+
  theme(axis.text.x = element_text(angle = 20, hjust=1))

library(corrr)

3.Correlation

Correlation between google search hits and web of science search hit is strong.

library(corrplot)
## corrplot 0.92 loaded
cor_matrix <- df_zs[c('googlehits_as_of_2_8_2017', 'wo_shits_as_of_2_6_2017', 'xc_zoo_score', 'xc_c_score')] %>%
  filter(!is.na(googlehits_as_of_2_8_2017) & !is.na(wo_shits_as_of_2_6_2017)) %>%
  cor(use = "pairwise.complete.obs")

corrplot(cor_matrix, type = 'upper', method = 'square',
         order = 'hclust', hclust.method = 'ward.D2')

  • Help: How can I see this correlation by xc_zoo_score?
df_zs[c('googlehits_as_of_2_8_2017', 'wo_shits_as_of_2_6_2017', 'xc_zoo_score', 'xc_c_score')] %>%
  filter(!is.na(googlehits_as_of_2_8_2017) & !is.na(wo_shits_as_of_2_6_2017)) %>%
  correlate(diagonal = 1, quiet = TRUE) %>% 
  stretch() %>% 
  ggplot(mapping = aes(x = x, y = y)) +
      geom_tile(mapping = aes(fill = r), 
            color = 'black') +
  geom_text(mapping = aes(label = round(r, 2)),
            size = 6) +
  coord_equal() +
  scale_fill_gradient2(low = 'red', mid = 'white', high = 'blue',
                       midpoint = 0,
                       limits = c(-1, 1)) +
  labs(x = '', y = '') +
  theme_bw()+
  theme(axis.text.x = element_text(angle = 60, hjust = 1))

4. Subset with Species Information

Compare data.

df_sub%>% glimpse()
## Rows: 4,350
## Columns: 89
## $ pathogen                                      <chr> "Acanthocephalus anguill…
## $ insect                                        <dbl> NA, NA, NA, NA, NA, NA, …
## $ genus_only                                    <dbl> 0, 0, 0, 0, 0, 0, 0, 0, …
## $ commensal                                     <lgl> NA, NA, NA, NA, NA, NA, …
## $ zoo_score                                     <chr> "-1", "-1", "-1", "-1", …
## $ confidence_score                              <dbl> 3, 3, 1, 2, 1, 1, 1, 1, …
## $ xc_zoo_score                                  <dbl> -1, -1, -1, -1, 0, -1, -…
## $ xc_c_score                                    <dbl> 1, 2, 2, 2, 2, 3, 2, 3, …
## $ xc_notes                                      <lgl> NA, NA, NA, NA, NA, NA, …
## $ xc_who_by                                     <chr> "VR", "VR", "VR", "VR", …
## $ xc_date                                       <date> 2017-01-30, 2017-01-30,…
## $ pgf_zoo_score                                 <dbl> NA, NA, NA, NA, NA, NA, …
## $ pgf_c_score                                   <dbl> NA, NA, NA, NA, NA, NA, …
## $ pgf_notes                                     <chr> NA, NA, NA, NA, NA, NA, …
## $ non_gmpd                                      <chr> "0", "0", NA, NA, NA, NA…
## $ search_string_goog                            <chr> "Acanthocephalus anguill…
## $ googlehits_as_of_2_8_2017                     <dbl> 1410, 431, 217, 65, 713,…
## $ search_string_wos                             <chr> "Acanthocephalus anguill…
## $ wo_shits_as_of_2_6_2017                       <dbl> 57, 23, 13, 0, 4, 2, 39,…
## $ notes                                         <chr> NA, NA, "H: dog", "H: pr…
## $ citation                                      <chr> "Kennedy and Moriarty 19…
## $ print_ref                                     <chr> NA, NA, NA, NA, NA, NA, …
## $ who_by                                        <chr> "VR", "VR", "VR", "VR", …
## $ date_entry                                    <date> 2017-01-30, 2017-01-30,…
## $ xc_citation                                   <lgl> NA, NA, NA, NA, NA, NA, …
## $ pgf_citation                                  <chr> NA, NA, NA, NA, NA, NA, …
## $ pgf_more_citations                            <chr> NA, NA, NA, NA, NA, NA, …
## $ nematode                                      <dbl> 0, 0, 0, 1, 1, 1, 1, 1, …
## $ species                                       <chr> NA, NA, NA, NA, NA, NA, …
## $ disease                                       <chr> NA, NA, NA, NA, NA, NA, …
## $ close                                         <dbl> NA, NA, NA, NA, NA, NA, …
## $ nonclose                                      <dbl> NA, NA, NA, NA, NA, NA, …
## $ vector                                        <dbl> NA, NA, NA, NA, NA, NA, …
## $ intermediate                                  <dbl> NA, NA, NA, NA, NA, NA, …
## $ country                                       <chr> NA, NA, NA, NA, NA, NA, …
## $ DOI                                           <chr> NA, NA, NA, NA, NA, NA, …
## $ evidence                                      <lgl> NA, NA, NA, NA, NA, NA, …
## $ evidence_notes                                <lgl> NA, NA, NA, NA, NA, NA, …
## $ source                                        <lgl> NA, NA, NA, NA, NA, NA, …
## $ checked_by                                    <lgl> NA, NA, NA, NA, NA, NA, …
## $ variable                                      <dbl> NA, NA, NA, NA, NA, NA, …
## $ vector_transmission                           <dbl> NA, NA, NA, NA, NA, NA, …
## $ zoonotic                                      <dbl> NA, NA, NA, NA, NA, NA, …
## $ type                                          <chr> NA, NA, NA, NA, NA, NA, …
## $ parasite_protozoa                             <dbl> NA, NA, NA, NA, NA, NA, …
## $ bacterium                                     <dbl> NA, NA, NA, NA, NA, NA, …
## $ fungi                                         <dbl> NA, NA, NA, NA, NA, NA, …
## $ virus_rna                                     <dbl> NA, NA, NA, NA, NA, NA, …
## $ virus_dna                                     <dbl> NA, NA, NA, NA, NA, NA, …
## $ parasite_other                                <dbl> NA, NA, NA, NA, NA, NA, …
## $ disease_code                                  <dbl> NA, NA, NA, NA, NA, NA, …
## $ incubation                                    <chr> NA, NA, NA, NA, NA, NA, …
## $ found_worldwide                               <dbl> NA, NA, NA, NA, NA, NA, …
## $ virus_family                                  <chr> NA, NA, NA, NA, NA, NA, …
## $ virus_genus                                   <chr> NA, NA, NA, NA, NA, NA, …
## $ vehicle_eaten_insect_mite_copepod             <dbl> NA, NA, NA, NA, NA, NA, …
## $ vehicle_none                                  <dbl> NA, NA, NA, NA, NA, NA, …
## $ vehicle_respiratory_or_pharyngeal_acquisition <dbl> NA, NA, NA, NA, NA, NA, …
## $ vehicle_water                                 <dbl> NA, NA, NA, NA, NA, NA, …
## $ vehicle_direct_physical_contact               <dbl> NA, NA, NA, NA, NA, NA, …
## $ vehicle_shellfish                             <dbl> NA, NA, NA, NA, NA, NA, …
## $ vehicle_trauma                                <dbl> NA, NA, NA, NA, NA, NA, …
## $ vehicle_dairy_products                        <dbl> NA, NA, NA, NA, NA, NA, …
## $ vehicle_meat_or_poultry                       <dbl> NA, NA, NA, NA, NA, NA, …
## $ vehicle_fecal_oral_human                      <dbl> NA, NA, NA, NA, NA, NA, …
## $ vehicle_fly                                   <dbl> NA, NA, NA, NA, NA, NA, …
## $ vehicle_food                                  <dbl> NA, NA, NA, NA, NA, NA, …
## $ vehicle_sexual_contact                        <dbl> NA, NA, NA, NA, NA, NA, …
## $ vehicle_vegetable_or_fruit                    <dbl> NA, NA, NA, NA, NA, NA, …
## $ vehicle_secretion_blood_or_tissue             <dbl> NA, NA, NA, NA, NA, NA, …
## $ vehicle_amphibian_or_reptile                  <dbl> NA, NA, NA, NA, NA, NA, …
## $ vehicle_snail_earthworm_or_slug               <dbl> NA, NA, NA, NA, NA, NA, …
## $ vehicle_animal_bite                           <dbl> NA, NA, NA, NA, NA, NA, …
## $ vehicle_droplet_dust_or_aerosol               <dbl> NA, NA, NA, NA, NA, NA, …
## $ vehicle_fish                                  <dbl> NA, NA, NA, NA, NA, NA, …
## $ vehicle_soil_or_vegetable_matter              <dbl> NA, NA, NA, NA, NA, NA, …
## $ vehicle_unknown                               <dbl> NA, NA, NA, NA, NA, NA, …
## $ vehicle_breastfeeding                         <dbl> NA, NA, NA, NA, NA, NA, …
## $ vector_none                                   <dbl> NA, NA, NA, NA, NA, NA, …
## $ vector_tick                                   <dbl> NA, NA, NA, NA, NA, NA, …
## $ vector_fly                                    <dbl> NA, NA, NA, NA, NA, NA, …
## $ vector_flea                                   <dbl> NA, NA, NA, NA, NA, NA, …
## $ vector_louse                                  <dbl> NA, NA, NA, NA, NA, NA, …
## $ vector_mite                                   <dbl> NA, NA, NA, NA, NA, NA, …
## $ vector_mosquito                               <dbl> NA, NA, NA, NA, NA, NA, …
## $ vector_sandfly                                <dbl> NA, NA, NA, NA, NA, NA, …
## $ vector_unknown                                <dbl> NA, NA, NA, NA, NA, NA, …
## $ vector_midge                                  <dbl> NA, NA, NA, NA, NA, NA, …
## $ vector_bug                                    <dbl> NA, NA, NA, NA, NA, NA, …

1. Zoo Score with Zoonotic

df_sub %>%
  filter(!is.na(xc_zoo_score) & !is.na(zoonotic))%>%
  count(xc_zoo_score,  zoonotic) %>%
  ggplot(mapping = aes(x=as.factor(xc_zoo_score), y=as.factor(zoonotic)))+
  geom_tile(mapping = aes(fill = n),
    color = 'black')+
  geom_label(mapping = aes(label = n,
                          color = n > median(n)),
            size = 2.5)+
  facet_wrap(~zoonotic)+
  scale_color_manual(guide = 'none', values = c('TRUE' = '#D31245',
                                                "FALSE" = '#091F40'))+
  scale_fill_continuous()+
  theme_bw()+
  theme(axis.text.x = element_text(angle = 20, hjust=1))

2. Zoo Score with RNA virus

df_sub %>%
  filter(!is.na(xc_zoo_score) & !is.na(close))%>%
  count(xc_zoo_score, virus_rna, close) %>%
  ggplot(mapping = aes(x=as.factor(virus_rna), y= as.factor(xc_zoo_score)))+
  geom_tile(mapping = aes(fill = n),
    color = 'black')+
  geom_label(mapping = aes(label = n,
                          color = n > median(n)),
            size = 2.5)+
  facet_wrap(~close)+
  scale_color_manual(guide = 'none', values = c('TRUE' = '#D31245',
                                                "FALSE" = '#091F40'))+
  scale_fill_continuous()+
  theme_bw()+
  theme(axis.text.x = element_text(angle = 20, hjust=1))

— to be continued–

Acknowledgment

This project has been funded with Federal funds from the National Library of Medicine (NLM), National Institutes of Health (NIH), under cooperative agreement number UG4LM01234 with the University of Massachusetts Chan Medical School, Lamar Soutter Library. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health.